Together AI Tutorial: From Basics to Advanced RAG

Last updated: August 2 2025

Welcome to this intermediate tutorial on Together AI! 🚀

Together AI is a cloud platform providing access to a vast array of open-source generative AI models. It's known for its high-performance inference, fine-tuning capabilities, and developer-friendly tools. This notebook will expand on the basics and guide you through building a more robust Retrieval-Augmented Generation (RAG) system, showcasing key features like high-quality embeddings and reranking.

1. Setup

First, let's install the necessary libraries. We'll be using numpy for numerical operations and python-dotenv to manage our API keys securely.

# Uncomment the following line to install the required packages
# %pip install together python-dotenv numpy

Next, create a file named .env in the same directory as this notebook and add your Together AI API key:

TOGETHER_API_KEY="your_api_key_here"

Now, let's load the API key and initialize the client.

import os
from dotenv import load_dotenv
import together

load_dotenv()

client = together.Together(api_key=os.environ.get("TOGETHER_API_KEY"))

2. A Quick Look at Together AI's Features

Before diving into RAG, it's worth noting some of Together AI's standout features:

Vast Model Library: Access to over 100 open-source models for various tasks (chat, code, image, embeddings).
High Performance: Optimized inference stack for fast response times.
OpenAI Compatibility: The Python client is designed to be a drop-in replacement for the OpenAI client, making migration easy.
Serverless Endpoints: Pay-as-you-go access to models without managing infrastructure.
Fine-tuning & Reranking: Tools to customize models and improve information retrieval.

3. Building an Advanced RAG System

Retrieval-Augmented Generation (RAG) enhances a large language model's responses by providing it with relevant information from an external knowledge base. This reduces hallucinations and allows the model to answer questions about specific, up-to-date, or private data.

Our RAG pipeline will involve:

Preparing a Knowledge Base: We'll create and process a text file.
Indexing: We'll use a high-quality embedding model to create vector representations of our data.
Retrieval: We'll find the most relevant documents for a given query.
Reranking: We'll use a reranker model to refine the search results.
Generation: We'll generate a final answer based on the retrieved and reranked information.

Step 1: Prepare the Knowledge Base

Let's create a small knowledge base about Together AI and save it to a file.

knowledge_base_content = """
Together AI offers a cloud platform for building and running generative AI. It provides access to over 100 open-source models.
The platform is designed for high-performance inference, leveraging techniques like speculative decoding.
For Retrieval-Augmented Generation, Together AI offers both embedding models and reranker models.
The BAAI/bge-large-en-v1.5 is a popular and powerful model for generating text embeddings.
Reranking is a crucial step in a RAG pipeline to improve the quality of retrieved documents before sending them to the language model.
The Together AI Python client is compatible with the OpenAI API, making it easy for developers to switch.
Users can fine-tune models on their own data to create specialized, expert models.
Together AI offers both serverless, pay-as-you-go endpoints and dedicated instances for large-scale applications.
"""

with open("knowledge_base.txt", "w") as f:
    f.write(knowledge_base_content)

# Now, we'll load the text and split it into chunks (in this case, by line).
with open("knowledge_base.txt", "r") as f:
    knowledge_base = [line.strip() for line in f.readlines() if line.strip()]

Step 2: Indexing with a High-Quality Embedding Model

We'll use BAAI/bge-large-en-v1.5, a top-performing embedding model available on Together AI, to convert our text chunks into vectors.

embedding_model = "BAAI/bge-large-en-v1.5"

embeddings_response = client.embeddings.create(
    model=embedding_model,
    input=knowledge_base,
)

document_embeddings = [embedding.embedding for embedding in embeddings_response.data]

Step 3: Retrieval

Next, we'll embed the user's query and use cosine similarity to find the most relevant chunks from our knowledge base.

import numpy as np
from numpy.linalg import norm

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

user_query = "How can I improve my RAG system?"
top_k = 3

query_embedding_response = client.embeddings.create(
    model=embedding_model,
    input=[user_query],
)
query_embedding = query_embedding_response.data[0].embedding

similarities = [cosine_similarity(query_embedding, doc_embedding) for doc_embedding in document_embeddings]

top_indices = np.argsort(similarities)[-top_k:][::-1]
retrieved_documents = [knowledge_base[i] for i in top_indices]

print("Retrieved documents (before reranking):")
for doc in retrieved_documents:
    print(f"- {doc}")

Retrieved documents (before reranking):
- Reranking is a crucial step in a RAG pipeline to improve the quality of retrieved documents before sending them to the language model.
- Users can fine-tune models on their own data to create specialized, expert models.
- The BAAI/bge-large-en-v1.5 is a popular and powerful model for generating text embeddings.

Step 4: Reranking for Quality

While cosine similarity is good, it's not perfect. A reranker model can take the retrieved documents and re-order them based on a more nuanced understanding of relevance to the query. This is a key feature for building production-quality RAG systems.

rerank_model = "mixedbread-ai/Mxbai-Rerank-Large-V2"

rerank_response = client.rerank.create(
    model=rerank_model,
    query=user_query,
    documents=retrieved_documents,
)

reranked_indices = [result.index for result in rerank_response.results]
reranked_documents = [retrieved_documents[i] for i in reranked_indices]

print("\nReranked documents:")
for doc in reranked_documents:
    print(f"- {doc}")

Reranked documents:
- The BAAI/bge-large-en-v1.5 is a popular and powerful model for generating text embeddings.
- Users can fine-tune models on their own data to create specialized, expert models.
- Reranking is a crucial step in a RAG pipeline to improve the quality of retrieved documents before sending them to the language model.

Step 5: Generation

Finally, we'll combine the query and the top reranked document into a prompt and send it to a powerful chat model to generate a comprehensive answer.

context = reranked_documents[0] # Use the top reranked document

prompt = f"""
Context: {context}
Question: {user_query}
Based on the provided context, give a concise answer.
Answer:
"""

response = client.chat.completions.create(
    model="meta-llama/Llama-3-8b-chat-hf",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.7,
)

print(f"\nFinal Answer:\n{response.choices[0].message.content}")

Final Answer:
To improve your RAG (Reactor Alignment Generator) system, consider fine-tuning the BAAI/bge-large-en-v1.5 model on your specific task and dataset to adapt its text embeddings to your use case. This can be done using a technique like masked language modeling or sentence similarity tasks to adjust the model's output to better suit your requirements.

4. Conclusion

This tutorial has demonstrated how to build a more sophisticated RAG system using Together AI. We've leveraged key features like high-quality embedding models and rerankers to improve the accuracy of our information retrieval. This approach significantly enhances the capabilities of large language models by grounding them in factual, external knowledge.

From here, you can explore more advanced topics such as:

Fine-tuning: Train a model on your own data for specialized tasks.
Larger Knowledge Bases: Integrate with vector databases like Pinecone or MongoDB for scalable RAG.
Different Models: Experiment with the wide variety of models available on the Together AI platform.